10 research outputs found

    Virtualized execution runtime for FPGA accelerators in the cloud

    Get PDF
    FPGAs offer high performance coupled with energy efficiency, making them extremely attractive computational resources within a cloud ecosystem. However, to achieve this integration and make them easy to program, we first need to enable users with varying expertise to easily develop cloud applications that leverage FPGAs. With the growing size of FPGAs, allocating them monolithically to users can be wasteful due to potentially low device utilization. Hence, we also need to be able to dynamically share FPGAs among multiple users. To address these concerns, we propose a methodology and a runtime system that together simplify the FPGA application development process by providing: 1) a clean abstraction with high-level APIs for easy application development; 2) a simple execution model that supports both hardware and software execution; and 3) a shared memory-model which is convenient to use for the programmers. Akin to an operating system on a computer, our lightweight runtime system enables the simultaneous execution of multiple applications by virtualizing computational resources, i.e., FPGA resources and on-board memory, and offers protection facilities to isolate applications from each other. In this paper, we illustrate how these features can be developed in a lightweight manner and quantitatively evaluate the performance overhead they introduce on a small set of applications running on our proof of concept prototype. Our results demonstrate that these features only introduce marginal performance overheads. More importantly, by sharing resources for simultaneous execution of multiple user applications, our platform improves FPGA utilization and delivers higher aggregate throughput compared to accessing the device in a time-shared manner

    DynaBurst: Dynamically Assemblying DRAM Bursts over a Multitude of Random Accesses

    No full text
    The effective bandwidth of the FPGA external memory, usually DRAM, is extremely sensitive to the access pattern. Nonblocking caches that handle thousands of outstanding misses (miss-optimized memory systems) can dynamically improve bandwidth utilization whenever memory accesses are irregular and application-specific optimizations are not available or are too costly in terms of design time. However, they require a memory controller with wide data ports on the FPGA side and cannot fully take advantage of the memory interfaces with multiple narrow ports that are common on SoC FPGAs. Moreover, as their scope is limited to single memory requests, the access pattern they generate may cause frequent DRAM row conflicts, which further reduce DRAM bandwidth. In this paper, we propose DynaBurst, an extension of miss-optimized memory systems that generates variable-length bursts to the memory controller. By making memory accesses locally more sequential, we minimize the number of DRAM row conflicts, and by adapting the burst length on a per-request basis we minimize bandwidth wastage. On a multiple, narrow-ported DDR3 controller, we provide 28% geometric mean and up to 3.4x speedup compared to a traditional nonblocking cache of the same area, while the prior single-request approach would not have been cost-effective. On a controller with a single, wide port, we can further improve the performance of miss-optimized systems by up to 2.4 x

    FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort

    No full text
    String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS(5)) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS(5) is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS(5), which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS(5) by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS(5) to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts

    In Search of Lost Bandwidth: Extensive Reordering of DRAM Accesses on FPGA

    No full text
    For efficient acceleration on FPGA, it is essential for external memory to match the throughput of the processing pipelines. However, the usable DRAM bandwidth decreases significantly if the access pattern causes frequent row conflicts. Memory controllers reorder DRAM commands to minimize row conflicts; however, general-purpose controllers must also minimize latency, which limits the depth of the internal queues over which reordering can occur. For latency-insensitive applications with irregular access pattern, nonblocking caches that support thousands of in-flight misses (miss-optimized memory systems) improve bandwidth utilization by reusing the same memory response to serve as many incoming requests as possible. However, they do not improve the irregularity of the access pattern sent to the memory, meaning that row conflicts will still be an issue. Sending out bursts instead of single memory requests makes the access pattern more sequential; however, realistic implementations trade high throughput for some unnecessary data in the bursts, leading to bandwidth wastage that cancels out part of the gains from regularization. In this paper, we present an alternative approach to extend the scope of DRAM row conflict minimization beyond the possibilities of general-purpose DRAM controllers. We use the thousands of future memory requests that spontaneously accumulate inside the miss-optimized memory system to implement an efficient large-scale reordering mechanism. By reordering single requests instead of sending bursts, we regularize the memory access pattern in a way that increases bandwidth utilization without incurring in any data wastage. Our solution outperforms the baseline miss-optimized memory system by up to 81% and has better worst, average, and best performance than DynaBurst across 15 benchmarks and 30 architectures

    Through Silicon Vias With Invar Metal Conductor for High-Temperature Applications

    No full text
    Through silicon vias (TSVs) are key enablers of 3-D integration technologies which, by vertically stacking andinterconnecting multiple chips, achieve higher performances,lower power, and a smaller footprint. Copper is the mostcommonly used conductor to fill TSVs; however, copper hasa high thermal expansion mismatch in relation to the siliconsubstrate. This mismatch results in a large accumulation ofthermomechanical stress when TSVs are exposed to high temperaturesand/or temperature cycles, potentially resulting in devicefailure. In this paper, we demonstrate 300 ÎŒm long, 7:1 aspectratio TSVs with Invar as a conductive material. The entireTSV structure can withstand at least 100 thermal cycles from −50 °C to 190 °C and at least 1 h at 365 °C, limited bythe experimental setup. This is possible thanks to matchingcoefficients of thermal expansion of the Invar via conductor andof silicon substrate. This results in thermomechanical stressesthat are one order of magnitude smaller compared to copperTSV structures with identical geometries, according to finiteelement modeling. Our TSV structures are thus a promisingapproach enabling 2.5-D and 3-D integration platforms for hightemperatureand harsh-environment applications.QC 20170207</p

    Designing a Virtual Runtime for FPGA Accelerators in the Cloud

    No full text
    FPGAs can provide high performance and energy efficiency to many applications; therefore, they are attractive computing platforms in a cloud environment. However, FPGA application development requires extensive hardware design knowledge which significantly limits the potential user base. Moreover, in a cloud setting, allocating a whole FPGA to a user is often wasteful and not cost effective due to low device utilization. To make FPGA application development easier, firstly, we propose a methodology that provides clean abstractions with high-level APIs and a simple execution model that supports both software and hardware execution. Secondly, to improve device utilization and share the FPGA among multiple users, we developed a lightweight runtime system that provides hardware-assisted memory virtualization and memory protection, enabling multiple applications to simultaneously execute on the device

    Capacitive inertial sensing at high temperatures of up to 400 degrees C

    No full text
    High-temperature-resistant inertial sensors are increasingly requested in a variety of fields such as aerospace, automotive and energy. Capacitive detection is especially suitable for sensing at high temperatures due to its low intrinsic temperature dependence. In this paper, we present high-temperature measurements utilizing a capacitive accelerometer, thereby proving the feasibility of capacitive detection at temperatures of up to 400 degrees C. We describe the observed characteristics as the temperature is increased and propose an explanation of the physical mechanisms causing the temperature dependence of the sensor, which mainly involve the temperature dependence of the Young's modulus and of the viscosity and the pressure of the gas inside the sensor cavity. Therefore a static electromechanical model and a dynamic model that takes into account squeeze film damping were developed.QC 20160319</p

    Through Silicon Vias With Invar Metal Conductor for High-Temperature Applications

    No full text

    Through-Glass Vias for MEMS Packaging

    No full text
    Novelty / Progress Claims We have developed a new method for fabrication of through-glass vias (TGVs). The method allows rapid filling of via holes with metal rods both in thin and thick glass substrates. Background Vertical electrical feedthroughs in glass substrates, i.e. TGVs, are often required in wafer-scale packaging of MEMS that utilizes glass lids. The current methods of making TGVs have drawbacks that prevent the full utilization of the excellent properties of glass as a package material, e.g. low RF losses. Magnetic assembly has been used earlier to fabricate through-silicon vias (TSVs), and in this work we extend this method to realize TGVs [1]. Methods The entire TGV fabrication process is maskless, and the processes used include: direct patterning of wafer metallization using femtosecond laser ablation, magnetic-fieldassisted self-assembly of metal wires into via holes, and solder-paste jetting of bump bonds on TGVs. Results We demonstrate that: (1) the magnetically assembled TGVs have a low resistance, which makes them suitable even for low-loss and high-current applications; (2) the magneticassembly process can be parallelized in order to increase the wafer-scale fabrication speed; (3) the magnetic assembly produces void-free metal filling for TGVs, which allows solder placement directly on top of the TGV for the purpose of high integration density; and (4) good thermal-expansion compatibility between TGV metals and glass substrates is possible with the right choice of materials, and several suitable metals-glass pairs are identified for possible improvement of package reliability [2]. [1] M. Laakso et al., IEEE 30th Int. Conf. on MEMS, 2017. DOI:10.1109/MEMSYS.2017.7863517 [2] M. Laakso et al., “Through-Glass Vias for Glass Interposers and MEMS Packaging Utilizing Magnetic Assembly of Microscale Metal Wires,” manuscript in preparatioQC 20181106</p

    Through-Glass Vias for MEMS Packaging

    No full text
    Novelty / Progress Claims We have developed a new method for fabrication of through-glass vias (TGVs). The method allows rapid filling of via holes with metal rods both in thin and thick glass substrates. Background Vertical electrical feedthroughs in glass substrates, i.e. TGVs, are often required in wafer-scale packaging of MEMS that utilizes glass lids. The current methods of making TGVs have drawbacks that prevent the full utilization of the excellent properties of glass as a package material, e.g. low RF losses. Magnetic assembly has been used earlier to fabricate through-silicon vias (TSVs), and in this work we extend this method to realize TGVs [1]. Methods The entire TGV fabrication process is maskless, and the processes used include: direct patterning of wafer metallization using femtosecond laser ablation, magnetic-fieldassisted self-assembly of metal wires into via holes, and solder-paste jetting of bump bonds on TGVs. Results We demonstrate that: (1) the magnetically assembled TGVs have a low resistance, which makes them suitable even for low-loss and high-current applications; (2) the magneticassembly process can be parallelized in order to increase the wafer-scale fabrication speed; (3) the magnetic assembly produces void-free metal filling for TGVs, which allows solder placement directly on top of the TGV for the purpose of high integration density; and (4) good thermal-expansion compatibility between TGV metals and glass substrates is possible with the right choice of materials, and several suitable metals-glass pairs are identified for possible improvement of package reliability [2]. [1] M. Laakso et al., IEEE 30th Int. Conf. on MEMS, 2017. DOI:10.1109/MEMSYS.2017.7863517 [2] M. Laakso et al., “Through-Glass Vias for Glass Interposers and MEMS Packaging Utilizing Magnetic Assembly of Microscale Metal Wires,” manuscript in preparatioQC 20181106</p
    corecore